chore: migrate to master, enforce ci#1
Open
georgewhewell wants to merge 105 commits into
Open
Conversation
Collapse the remote execution stack onto the canonical graph request shape, move quote discovery into the rpc crate, and remove the CLI-only discovery split. This also folds in the executor/client fixes needed to make the new path work end-to-end.
Document the deny-by-default serve flow, surface the active policy mode at startup, and add the Nix/docker packaging helpers that make the new deployment shape usable.
- Enable `otel` feature on tonic-iroh-transport - Register W3C TraceContextPropagator globally - Wrap server services with TraceContextExtractor - Wrap client RPC channels with TraceContextInjector - Make RemoteExecuteDriver generic to support intercepted channels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s everywhere
OpenTelemetry plumbing is now opt-in via the `otel` feature on `hellas-cli`
(default off). With the feature off, none of opentelemetry / opentelemetry_sdk
/ opentelemetry-otlp / tracing-opentelemetry / reqwest compile, and the
trace-context propagation glue collapses to identity. Plain `tracing::info!`
/ `warn!` / etc. macros stay unconditional — they're no-op-cheap without a
subscriber.
Cfg surface is concentrated, not sprinkled:
- One function pair in `tracing_config.rs` (`install_with_otel`); registry
composition stays cfg-free. `TracerGuard` newtype hides the
`Option<SdkTracerProvider>` behind a cfg-gated field so `main.rs` drops the
`if let Some(provider) = ...` dance around shutdown.
- `execution.rs`: cfg-swap of the `TracedChannel` type alias plus a
`traced(channel)` helper collapses 8 `InterceptedService::new(channel,
TraceContextInjector)` sites and avoids spreading cfg across the file.
- `serve/node.rs`: a single `traced_service<S>` helper replaces 5
`trace_layer.layer(...)` sites.
iroh's internal `EndpointMetrics` are bridged into the existing
`prometheus-client` registry exposed at `/metrics`. The cli's `otel` feature
also enables `tonic-iroh-transport/metrics`, and `serve` attaches
`endpoint.metrics()` (clone of Arcs into live storage) via
`MetricsBundle::with_iroh`. The HTTP handler emits prometheus-client text
followed by iroh's OpenMetrics text in one well-formed response with a single
`# EOF` terminator. Verified end-to-end: `endpoint_socket_send_ipv4_total`
etc. show up alongside `hellas_*` counters.
Switch all TLS to rustls so the crate compiles in weird places (wasm,
cross-compile, no system openssl):
- Workspace `tonic-iroh-transport`: drop `["otel", "native-defaults"]`,
pin to v0.9.2, use granular features `["tls-ring", "portmapper",
"fast-apple-datapath"]`. v0.9.2 exposes the new passthroughs.
- Workspace `reqwest`: switch to `["rustls", "webpki-roots"]` (was
`["rustls-native-certs"]`, which lacks an actual TLS provider — the cause
of the `"invalid URL, scheme is not http"` symptom against
jaeger.lsd-ag.ch).
- `opentelemetry-otlp`: add `"reqwest-rustls-webpki-roots"` so its internal
`reqwest 0.12` (separate from our 0.13) gets a TLS provider too.
- `crates/executor/Cargo.toml`: `hf-hub = "0.5"` was secretly pulling
`native-tls` -> `openssl-sys` via default features. Pin to
`default-features = false, features = ["ureq"]`, matching `hellas-rpc`.
- Drop `pkgs.openssl` from `nix/default.nix`, `nix/docker.nix`,
`nix/package.nix`. `ldd target/debug/hellas-cli` is now empty for
`libssl`/`libcrypto` in both default and `candle,otel` builds.
Dev workflow:
- `rust-analyzer.toml` at workspace root pins RA's feature set to
`["candle", "otel"]` so type-checking covers gated modules across
editors. Replaces the abandoned `HELLAS_FEATURES` env var / cargo shim
approach.
- `nix/default.nix:88`: `hellas-run` wrapper drops
`--features "${HELLAS_FEATURES:-candle}"` in favor of explicit
`--features candle`.
Build matrix: all four cli feature combos compile (`{}`, `candle`, `otel`,
`candle,otel`); workspace check + clippy clean; HTTPS OTLP connect now
succeeds (DNS to jaeger.lsd-ag.ch is environmental).
- expose individual check-{fmt,clippy,sort,test} apps for matrix dispatch
- add `cargo test --workspace` check (default features)
- workflow enumerates check-* apps from the flake, runs each on
`[self-hosted, shared]`; gates with `CI passed` aggregate job
- opt-in `cache.hellas.ai` substituter via flake.nix nixConfig
writeShellApplication strips PATH down to runtimeInputs only, so cargo's default linker invocation (`cc`) was failing on the runner.
Single source of truth in nix/ci.nix is now an attrset
{ name -> { check, fix? } }. Exposed flat as `.#ci.<system>.commands`
for the GitHub Actions matrix to enumerate. CI runs each command via
`nix develop -c` so the dev shell is the runtime environment — no
more per-check writeShellApplication wrappers with hand-curated input
lists.
Workflow gains a `devshell` warmup job (clean failure surface for
env issues) and drops `--accept-flake-config` (runner daemon already
trusts cache.hellas.ai; flake nixConfig is for downstream users).
`nix run .#fix` previously fell back to running the check command for entries without a `fix` field — so it ran cargo test (slow) and cargo outdated (heuristic) during fix mode. Now those are filtered out entirely; fix only runs entries with an explicit fix variant.
Mechanical changes from `nix run .#fix`: rustfmt across the workspace, cargo-sort across all Cargo.toml files, and clippy's --fix for the auto-resolvable lints (collapsible if/let chains).
EnqueueError / StartExecutionError now wrap ExecuteJob in Box (was ~232 bytes inline). PreparedRoute / OpaquePreparedRoute box the RemoteDirect variant which held a ~1KB RemoteExecution. The remaining variant-size disparity in the route enums is annotated with `#[allow(clippy::large_enum_variant)]` — the variants are heterogeneous by nature and the enum lives only briefly during execution setup.
After `CI passed`, on push only, build the slow targets in parallel: static-x86_64 cross.x86_64-linux-musl.cli static-aarch64 cross.aarch64-linux-musl.cli docker-cpu default cpu image docker-cuda alias of docker-cuda12-sm89 (new) Matrix is driven from `.#ci.<sys>.builds` (same data-driven pattern as `commands`). `Extended builds passed` is a separate gate from `CI passed` so branch protection can require them independently.
120cd35 to
9f36f95
Compare
9f36f95 to
2f50145
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.